AUC / ROC AUC (Area Under the ROC Curve)#

AUC-ROC is a ranking metric for binary classification. It answers a simple question:

If we randomly pick one positive and one negative example, what’s the probability the model assigns a higher score to the positive one?

This notebook focuses on ROC AUC (the most common meaning of “AUC” in classification).

Goals

  • Build the ROC curve from the confusion matrix by sweeping a threshold

  • Compute AUC via the trapezoidal rule (what sklearn.metrics.auc does)

  • Implement roc_curve + roc_auc_score from scratch in NumPy

  • Visualize why AUC is threshold-free and how it relates to ranking

  • Use an AUC-inspired surrogate loss to train a simple linear model

Quick import (scikit-learn)

from sklearn.metrics import roc_auc_score, roc_curve, auc

Assumption throughout: binary labels y {0,1} and a real-valued score s(x) where larger means “more positive”.

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import auc, roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

np.set_printoptions(precision=4, suppress=True)
rng = np.random.default_rng(42)

1) Intuition: AUC is about ordering, not thresholds#

Suppose your model outputs a score \(s(x)\) (a probability, logit, margin, etc.).

  • If we pick a threshold \(\tau\), we turn scores into hard predictions.

  • If we sweep \(\tau\) from \(+\infty\) down to \(-\infty\), we trace out the ROC curve.

  • AUC summarizes the entire ROC curve into one number in \([0,1]\).

Two key takeaways:

  1. Only ranking matters: any strictly increasing transform \(g\) preserves AUC.

    \[\mathrm{AUC}(s) = \mathrm{AUC}(g\circ s)\]
  2. Pairwise probability interpretation:

    \[ \mathrm{AUC} = \mathbb{P}(s(X^+) > s(X^-)) + \tfrac{1}{2}\,\mathbb{P}(s(X^+) = s(X^-)) \]

    where \(X^+\) is a random positive example and \(X^-\) is a random negative example.

# A toy "score" example: overlapping score distributions
n_pos, n_neg = 400, 600

pos_scores = rng.normal(loc=1.0, scale=1.0, size=n_pos)
neg_scores = rng.normal(loc=0.0, scale=1.0, size=n_neg)

y_true = np.r_[np.ones(n_pos, dtype=int), np.zeros(n_neg, dtype=int)]
y_score = np.r_[pos_scores, neg_scores]

toy_auc = roc_auc_score(y_true, y_score)
print(f"Toy ROC AUC = {toy_auc:.3f}")

fig = px.histogram(
    {"score": y_score, "class": np.where(y_true == 1, "positive", "negative")},
    x="score",
    color="class",
    opacity=0.6,
    barmode="overlay",
    marginal="box",
    title="Toy scores: overlap between classes",
    labels={"score": "model score s(x)", "class": "true class"},
)
fig.show()
Toy ROC AUC = 0.772

2) From thresholds to the ROC curve#

Given a threshold \(\tau\), convert scores to predicted labels:

\[\hat{y}_i(\tau) = \mathbb{1}[s_i \ge \tau]\]

Confusion-matrix counts (as functions of \(\tau\)):

\[ \mathrm{TP}(\tau) = \sum_{i=1}^n \mathbb{1}[y_i=1 \land \hat{y}_i(\tau)=1] \quad \mathrm{FP}(\tau) = \sum_{i=1}^n \mathbb{1}[y_i=0 \land \hat{y}_i(\tau)=1] \]
\[ \mathrm{TN}(\tau) = \sum_{i=1}^n \mathbb{1}[y_i=0 \land \hat{y}_i(\tau)=0] \quad \mathrm{FN}(\tau) = \sum_{i=1}^n \mathbb{1}[y_i=1 \land \hat{y}_i(\tau)=0] \]

Normalize to rates:

\[ \mathrm{TPR}(\tau) = \frac{\mathrm{TP}(\tau)}{P} \qquad \mathrm{FPR}(\tau) = \frac{\mathrm{FP}(\tau)}{N} \]

where \(P\) is the number of positives and \(N\) the number of negatives.

The ROC curve is the set of points:

\[\{(\mathrm{FPR}(\tau), \mathrm{TPR}(\tau)) : \tau \in \mathbb{R}\}\]
def confusion_counts_at_threshold(y_true, y_score, threshold):
    y_true = np.asarray(y_true).astype(int)
    y_score = np.asarray(y_score)
    y_pred = (y_score >= threshold).astype(int)

    tp = int(np.sum((y_true == 1) & (y_pred == 1)))
    fp = int(np.sum((y_true == 0) & (y_pred == 1)))
    tn = int(np.sum((y_true == 0) & (y_pred == 0)))
    fn = int(np.sum((y_true == 1) & (y_pred == 0)))
    return tp, fp, tn, fn


for thr in [2.0, 1.0, 0.0, -1.0]:
    tp, fp, tn, fn = confusion_counts_at_threshold(y_true, y_score, threshold=thr)
    tpr = tp / (tp + fn)
    fpr = fp / (fp + tn)
    print(f"thr={thr:>4}: TP={tp:>3} FP={fp:>3} TN={tn:>3} FN={fn:>3} | TPR={tpr:.3f} FPR={fpr:.3f}")
thr= 2.0: TP= 53 FP= 11 TN=589 FN=347 | TPR=0.133 FPR=0.018
thr= 1.0: TP=203 FP= 88 TN=512 FN=197 | TPR=0.507 FPR=0.147
thr= 0.0: TP=338 FP=299 TN=301 FN= 62 | TPR=0.845 FPR=0.498
thr=-1.0: TP=392 FP=491 TN=109 FN=  8 | TPR=0.980 FPR=0.818
fpr, tpr, thresholds = roc_curve(y_true, y_score)
roc_auc = auc(fpr, tpr)

fig = go.Figure()

fig.add_trace(
    go.Scatter(
        x=fpr,
        y=tpr,
        mode="lines",
        name=f"ROC (AUC = {roc_auc:.3f})",
        line=dict(width=3),
    )
)

fig.add_trace(
    go.Scatter(
        x=[0, 1],
        y=[0, 1],
        mode="lines",
        name="Random (AUC = 0.5)",
        line=dict(dash="dash", color="gray"),
    )
)

# Color ROC points by threshold (skip the first threshold = +inf)
fig.add_trace(
    go.Scatter(
        x=fpr[1:],
        y=tpr[1:],
        mode="markers",
        name="Threshold points",
        marker=dict(
            size=6,
            color=thresholds[1:],
            colorscale="Viridis",
            showscale=True,
            colorbar=dict(title="threshold"),
        ),
    )
)

fig.update_layout(
    title="ROC curve (TPR vs FPR) from a threshold sweep",
    xaxis_title="False Positive Rate (FPR)",
    yaxis_title="True Positive Rate (TPR)",
    width=750,
    height=520,
)

# Make the plot square-ish so the diagonal really looks like 45 degrees
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()

3) AUC = area under the ROC curve (trapezoidal rule)#

Once you have ROC points \((x_k, y_k)\) with \(x_k\) = FPR and \(y_k\) = TPR sorted by increasing FPR, you can approximate the integral using trapezoids:

\[ \mathrm{AUC} \approx \sum_{k=0}^{m-2} (x_{k+1} - x_k)\,\frac{y_{k+1}+y_k}{2} \]

This is exactly what sklearn.metrics.auc(x, y) computes (for any curve, not only ROC).

def auc_trapezoid(x, y):
    '''Compute area under a curve via the trapezoidal rule.

    Parameters
    ----------
    x, y : 1D arrays of same length
        x must be monotonically increasing.
    '''
    x = np.asarray(x, dtype=float)
    y = np.asarray(y, dtype=float)
    if x.ndim != 1 or y.ndim != 1 or x.shape != y.shape:
        raise ValueError("x and y must be 1D arrays with the same shape")

    dx = np.diff(x)
    if np.any(dx < 0):
        raise ValueError("x must be monotonically increasing")

    return float(np.sum(dx * (y[1:] + y[:-1]) / 2.0))


print(f"sklearn auc(fpr,tpr)          = {auc(fpr, tpr):.10f}")
print(f"NumPy  auc_trapezoid(fpr,tpr) = {auc_trapezoid(fpr, tpr):.10f}")
sklearn auc(fpr,tpr)          = 0.7715416667
NumPy  auc_trapezoid(fpr,tpr) = 0.7715416667

4) From scratch: ROC curve and ROC AUC in NumPy#

A convenient way to build the ROC curve:

  1. Sort examples by score (descending).

  2. Sweep a threshold from high to low (equivalently: move down the sorted list).

  3. Track how many positives/negatives we’ve included so far.

  4. Record \((\mathrm{FPR},\mathrm{TPR})\) whenever the score changes.

This produces the same step-shaped ROC curve you get from scikit-learn.

def roc_curve_numpy(y_true, y_score):
    '''Compute ROC curve (FPR, TPR, thresholds) for binary classification.

    Matches the standard approach used by scikit-learn:
    - thresholds are unique score values in descending order, with an initial +inf
    - the returned curve starts at (0,0)

    Parameters
    ----------
    y_true : array-like of shape (n,)
        Binary labels in {0,1}
    y_score : array-like of shape (n,)
        Scores where larger means more positive
    '''
    y_true = np.asarray(y_true)
    y_score = np.asarray(y_score)
    if y_true.shape != y_score.shape:
        raise ValueError("y_true and y_score must have the same shape")

    unique = np.unique(y_true)
    if not set(unique.tolist()).issubset({0, 1}):
        raise ValueError("y_true must be binary with values in {0,1}")

    y_true = y_true.astype(int)
    n_pos = int(np.sum(y_true == 1))
    n_neg = int(np.sum(y_true == 0))
    if n_pos == 0 or n_neg == 0:
        raise ValueError("ROC is undefined when only one class is present")

    # Sort by score descending (stable sort so ties are handled consistently)
    order = np.argsort(-y_score, kind="mergesort")
    y_true_sorted = y_true[order]
    y_score_sorted = y_score[order]

    # Indices where score changes (each threshold is a distinct score value)
    distinct = np.where(y_score_sorted[1:] != y_score_sorted[:-1])[0]
    threshold_idxs = np.r_[distinct, y_true_sorted.size - 1]

    tps = np.cumsum(y_true_sorted)[threshold_idxs]
    fps = (threshold_idxs + 1) - tps

    tpr = tps / n_pos
    fpr = fps / n_neg
    thresholds = y_score_sorted[threshold_idxs]

    # Prepend the (0,0) point with threshold = +inf
    tpr = np.r_[0.0, tpr]
    fpr = np.r_[0.0, fpr]
    thresholds = np.r_[np.inf, thresholds]

    return fpr, tpr, thresholds


def roc_auc_score_numpy(y_true, y_score):
    fpr, tpr, _ = roc_curve_numpy(y_true, y_score)
    return auc_trapezoid(fpr, tpr)


fpr_np, tpr_np, thr_np = roc_curve_numpy(y_true, y_score)
auc_np = roc_auc_score_numpy(y_true, y_score)

print(f"sklearn roc_auc_score = {roc_auc_score(y_true, y_score):.10f}")
print(f"NumPy   roc_auc_score = {auc_np:.10f}")

# Sanity check: same AUC
assert np.isclose(auc_np, roc_auc_score(y_true, y_score))
sklearn roc_auc_score = 0.7715416667
NumPy   roc_auc_score = 0.7715416667

5) AUC as a rank statistic (Mann–Whitney / Wilcoxon)#

The “pairwise probability” view can be written explicitly as:

\[ \mathrm{AUC} = \frac{1}{PN}\sum_{i\in\mathcal{P}}\sum_{j\in\mathcal{N}} \Big(\mathbb{1}[s_i > s_j] + \tfrac{1}{2}\,\mathbb{1}[s_i = s_j]\Big) \]

There’s an equivalent computation using ranks. Let \(r_i\) be the rank of \(s_i\) among all scores (rank 1 = smallest score), using average ranks for ties. Then:

\[ \mathrm{AUC} = \frac{\sum_{i\in\mathcal{P}} r_i - \frac{P(P+1)}{2}}{PN} \]

This is useful because it can be computed in \(O(n\log n)\) via sorting.

def rankdata_average_ties(x):
    '''Rank data with average ranks for ties (1 = smallest).'''
    x = np.asarray(x)
    order = np.argsort(x, kind="mergesort")
    x_sorted = x[order]

    ranks_sorted = np.empty(x_sorted.shape[0], dtype=float)
    i = 0
    n = x_sorted.shape[0]
    while i < n:
        j = i
        while j + 1 < n and x_sorted[j + 1] == x_sorted[i]:
            j += 1

        # ranks are 1..n, positions i..j correspond to ranks (i+1)..(j+1)
        avg_rank = 0.5 * ((i + 1) + (j + 1))
        ranks_sorted[i : j + 1] = avg_rank
        i = j + 1

    ranks = np.empty_like(ranks_sorted)
    ranks[order] = ranks_sorted
    return ranks


def roc_auc_score_rank_numpy(y_true, y_score):
    y_true = np.asarray(y_true).astype(int)
    y_score = np.asarray(y_score)
    n_pos = int(np.sum(y_true == 1))
    n_neg = int(np.sum(y_true == 0))
    if n_pos == 0 or n_neg == 0:
        raise ValueError("ROC AUC is undefined when only one class is present")

    ranks = rankdata_average_ties(y_score)
    sum_pos_ranks = float(np.sum(ranks[y_true == 1]))
    return (sum_pos_ranks - n_pos * (n_pos + 1) / 2.0) / (n_pos * n_neg)


auc_rank = roc_auc_score_rank_numpy(y_true, y_score)
print(f"ROC AUC (rank formula) = {auc_rank:.10f}")
print(f"ROC AUC (ROC+trapz)    = {roc_auc_score_numpy(y_true, y_score):.10f}")
assert np.isclose(auc_rank, roc_auc_score_numpy(y_true, y_score))

# Direct pairwise computation: O(P*N), only feasible for small n
pos_idx = np.flatnonzero(y_true == 1)
neg_idx = np.flatnonzero(y_true == 0)

sub_idx = np.r_[
    rng.choice(pos_idx, size=60, replace=False),
    rng.choice(neg_idx, size=60, replace=False),
]
sub_idx = rng.permutation(sub_idx)

y_sub = y_true[sub_idx]
s_sub = y_score[sub_idx]

pos = s_sub[y_sub == 1]
neg = s_sub[y_sub == 0]

diffs = pos[:, None] - neg[None, :]
auc_pairwise = (np.sum(diffs > 0) + 0.5 * np.sum(diffs == 0)) / (pos.size * neg.size)

print(f"ROC AUC (pairwise, small) = {auc_pairwise:.10f}")
assert np.isclose(auc_pairwise, roc_auc_score_rank_numpy(y_sub, s_sub))
ROC AUC (rank formula) = 0.7715416667
ROC AUC (ROC+trapz)    = 0.7715416667
ROC AUC (pairwise, small) = 0.7977777778

Monotonic transforms don’t change AUC#

Because AUC depends only on the ordering of scores, any strictly increasing transform keeps the ROC curve (and AUC) the same.

def softplus(z):
    # stable log(1 + exp(z))
    return np.logaddexp(0.0, z)


s = y_score
s_transformed = softplus(3.0 * s)  # strictly increasing

print(f"AUC(s)            = {roc_auc_score_numpy(y_true, s):.10f}")
print(f"AUC(softplus(3s)) = {roc_auc_score_numpy(y_true, s_transformed):.10f}")
AUC(s)            = 0.7715416667
AUC(softplus(3s)) = 0.7715416667

6) A threshold view: how TPR and FPR change as you move the cutoff#

ROC plots TPR vs FPR, but sometimes it’s useful to see both rates as explicit functions of the threshold \(\tau\).

# Use the NumPy ROC implementation so we have (fpr,tpr,threshold) aligned
fpr_np, tpr_np, thr_np = roc_curve_numpy(y_true, y_score)

# Drop the first threshold (+inf) for plotting
thr_plot = thr_np[1:]
tpr_plot = tpr_np[1:]
fpr_plot = fpr_np[1:]

fig = go.Figure()
fig.add_trace(go.Scatter(x=thr_plot, y=tpr_plot, mode="lines+markers", name="TPR"))
fig.add_trace(go.Scatter(x=thr_plot, y=fpr_plot, mode="lines+markers", name="FPR"))

fig.update_layout(
    title="TPR and FPR as functions of the threshold",
    xaxis_title="threshold τ (higher = stricter)",
    yaxis_title="rate",
    width=800,
    height=450,
)

# thresholds are in descending score order; flip the x-axis so we read left→right as τ decreases
fig.update_xaxes(autorange="reversed")
fig.show()

7) Practical usage: ROC AUC for a classifier#

In practice, you usually:

  1. Train a model with a differentiable loss (log-loss, hinge, etc.).

  2. Evaluate ROC AUC on validation/test using scores (predict_proba or decision_function).

Important: AUC expects scores, not hard class labels.

X, y = make_classification(
    n_samples=2500,
    n_features=10,
    n_informative=4,
    n_redundant=0,
    weights=[0.85, 0.15],
    class_sep=1.0,
    random_state=42,
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)

clf = make_pipeline(
    StandardScaler(),
    LogisticRegression(max_iter=2000),
)
clf.fit(X_train, y_train)

# Probability scores for the positive class
y_score_test = clf.predict_proba(X_test)[:, 1]

auc_test = roc_auc_score(y_test, y_score_test)
print(f"Test ROC AUC = {auc_test:.3f}")

fpr_test, tpr_test, _ = roc_curve(y_test, y_score_test)

fig = go.Figure()
fig.add_trace(go.Scatter(x=fpr_test, y=tpr_test, mode="lines", name=f"Test ROC (AUC={auc_test:.3f})"))
fig.add_trace(go.Scatter(x=[0, 1], y=[0, 1], mode="lines", line=dict(dash="dash", color="gray"), name="Random"))
fig.update_layout(
    title="ROC curve on a held-out test set",
    xaxis_title="False Positive Rate (FPR)",
    yaxis_title="True Positive Rate (TPR)",
    width=750,
    height=520,
)
fig.update_yaxes(scaleanchor="x", scaleratio=1)
fig.show()

# Same AUC via our NumPy implementation
print(f"NumPy ROC AUC = {roc_auc_score_numpy(y_test, y_score_test):.10f}")
Test ROC AUC = 0.733
NumPy ROC AUC = 0.7329897748

8) Using AUC for optimization (via a smooth surrogate)#

The “true” AUC objective is based on an indicator:

\[ \max_w\; \frac{1}{PN}\sum_{i\in\mathcal{P}}\sum_{j\in\mathcal{N}} \mathbb{1}[s_w(x_i) > s_w(x_j)] \]

This is hard to optimize directly because the indicator is non-differentiable and the double sum over all pairs is \(O(PN)\).

A common trick: replace the indicator with a pairwise surrogate loss. For a linear scoring model \(s_w(x)=w^\top x\), define \(d_{ij}=w^\top(x_i-x_j)\) and use the pairwise logistic loss:

\[ \ell(d) = \log(1+e^{-d}) \]

Then minimize:

\[ L(w)=\frac{1}{PN}\sum_{i\in\mathcal{P}}\sum_{j\in\mathcal{N}} \ell\big(w^\top(x_i-x_j)\big) \]

The gradient of one pair is:

\[ \nabla_w\ell\big(w^\top(x_i-x_j)\big) = -\sigma\big(-w^\top(x_i-x_j)\big)\,(x_i-x_j) \]

where \(\sigma(z)=\frac{1}{1+e^{-z}}\) is the sigmoid.

In code, we approximate the full double sum with mini-batches of random positive-negative pairs.

def sigmoid(z):
    z = np.asarray(z, dtype=float)
    out = np.empty_like(z)
    pos = z >= 0
    out[pos] = 1.0 / (1.0 + np.exp(-z[pos]))
    exp_z = np.exp(z[~pos])
    out[~pos] = exp_z / (1.0 + exp_z)
    return out


def standardize_fit(X):
    mean = X.mean(axis=0)
    std = X.std(axis=0)
    std = np.where(std == 0, 1.0, std)
    return mean, std


def standardize_transform(X, mean, std):
    return (X - mean) / std


def add_intercept(X):
    return np.concatenate([np.ones((X.shape[0], 1)), X], axis=1)


# Prepare a smaller dataset for fast training curves
X_small, y_small = make_classification(
    n_samples=1400,
    n_features=8,
    n_informative=3,
    n_redundant=0,
    weights=[0.9, 0.1],
    class_sep=0.9,
    flip_y=0.03,
    random_state=7,
)

X_tr, X_va, y_tr, y_va = train_test_split(
    X_small, y_small, test_size=0.35, stratify=y_small, random_state=7
)

mean, std = standardize_fit(X_tr)
X_tr = standardize_transform(X_tr, mean, std)
X_va = standardize_transform(X_va, mean, std)

X_tr = add_intercept(X_tr)
X_va = add_intercept(X_va)


def train_logistic_ce(X, y, X_val, y_val, lr=0.1, reg=1e-3, epochs=30, batch_size=256, seed=0):
    rng_local = np.random.default_rng(seed)
    n, d = X.shape
    w = np.zeros(d)
    auc_hist = []

    for _ in range(epochs):
        idx = rng_local.permutation(n)
        for start in range(0, n, batch_size):
            batch = idx[start : start + batch_size]
            xb = X[batch]
            yb = y[batch]

            logits = xb @ w
            p = sigmoid(logits)

            grad = (xb.T @ (p - yb)) / xb.shape[0] + reg * w
            w -= lr * grad

        auc_hist.append(roc_auc_score_numpy(y_val, X_val @ w))

    return w, np.array(auc_hist)


def train_auc_pairwise(
    X,
    y,
    X_val,
    y_val,
    lr=0.1,
    reg=1e-3,
    epochs=30,
    steps_per_epoch=200,
    batch_pairs=512,
    seed=0,
):
    rng_local = np.random.default_rng(seed)
    n, d = X.shape
    w = np.zeros(d)

    pos_idx = np.flatnonzero(y == 1)
    neg_idx = np.flatnonzero(y == 0)
    if pos_idx.size == 0 or neg_idx.size == 0:
        raise ValueError("Need both classes for AUC training")

    auc_hist = []

    for _ in range(epochs):
        for _ in range(steps_per_epoch):
            pi = rng_local.choice(pos_idx, size=batch_pairs, replace=True)
            ni = rng_local.choice(neg_idx, size=batch_pairs, replace=True)

            diff = X[pi] - X[ni]
            d_scores = diff @ w

            # loss per pair: log(1 + exp(-d)) = softplus(-d)
            # grad: -sigmoid(-d) * diff
            weight = sigmoid(-d_scores)
            grad = -(weight[:, None] * diff).mean(axis=0) + reg * w
            w -= lr * grad

        auc_hist.append(roc_auc_score_numpy(y_val, X_val @ w))

    return w, np.array(auc_hist)


w_ce, auc_ce = train_logistic_ce(
    X_tr,
    y_tr,
    X_va,
    y_va,
    lr=0.2,
    reg=1e-3,
    epochs=30,
    batch_size=256,
    seed=1,
)
w_auc, auc_auc = train_auc_pairwise(
    X_tr,
    y_tr,
    X_va,
    y_va,
    lr=0.2,
    reg=1e-3,
    epochs=30,
    steps_per_epoch=150,
    batch_pairs=512,
    seed=1,
)

print(f"Final val AUC (cross-entropy training) = {auc_ce[-1]:.3f}")
print(f"Final val AUC (pairwise AUC training)  = {auc_auc[-1]:.3f}")
Final val AUC (cross-entropy training) = 0.935
Final val AUC (pairwise AUC training)  = 0.945
fig = go.Figure()
fig.add_trace(go.Scatter(y=auc_ce, mode="lines+markers", name="Train w/ log-loss (CE)"))
fig.add_trace(go.Scatter(y=auc_auc, mode="lines+markers", name="Train w/ pairwise AUC surrogate"))

fig.update_layout(
    title="Validation ROC AUC during training",
    xaxis_title="epoch",
    yaxis_title="ROC AUC",
    width=850,
    height=450,
    yaxis=dict(range=[0.45, 1.0]),
)
fig.show()

9) Pros, cons, and when to use ROC AUC#

Pros

  • Threshold-free: summarizes performance across all possible thresholds.

  • Ranking interpretation: \(\mathrm{AUC}\approx \mathbb{P}(s(X^+) > s(X^-))\) is intuitive.

  • Insensitive to score calibration: if your ranking is good but probabilities aren’t calibrated, AUC can still be high.

  • Works well for model comparison when you don’t yet know the operating threshold.

Cons / pitfalls

  • Not about calibration: a model can have great AUC and terrible probability estimates.

  • May be misleading under heavy class imbalance if you care about the precision regime (consider PR AUC).

  • Averages over regions you may not care about (e.g., you might only tolerate FPR < 1%).

  • Hard to optimize directly: the exact AUC objective is non-smooth and pairwise.

  • Depends on label meaning and score direction: flipping the positive class changes AUC to \(1-\mathrm{AUC}\).

Good use cases

  • When you care about ranking (who is more likely positive) more than a single fixed decision threshold.

  • When you’ll choose the operating threshold later (business constraints, costs, etc.).

  • As a model-selection metric for classifiers that output usable scores.

10) Exercises#

  1. Create a highly imbalanced dataset (e.g. 1% positives) and compare ROC AUC vs PR AUC.

  2. Show two models with the same ROC AUC but very different TPR when FPR < 0.01.

  3. Extend the NumPy ROC code to support sample weights.

  4. For the pairwise AUC training, try a hinge loss \(\ell(d)=\max(0, 1-d)\) and compare curves.

References#

  • scikit-learn: roc_curve, roc_auc_score, auc

    • https://scikit-learn.org/stable/modules/model_evaluation.html

    • https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html

    • https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_curve.html

    • https://scikit-learn.org/stable/modules/generated/sklearn.metrics.auc.html

  • Tom Fawcett (2006): An introduction to ROC analysis

  • Wilcoxon / Mann–Whitney rank-sum statistic interpretation of AUC